Improving Inter-thread Data Sharing with GPU Caches
ثبت نشده
چکیده
The massive amount of fine-grained parallelism exposed by a GPU program makes it difficult to exploit shared cache benefits even there is good program locality. The non deterministic feature of thread execution in the bulk synchronize parallel (BSP) model makes the situation even worse. Most prior work in exploiting GPU cache sharing focuses on regular applications that have linear memory access indices. In this paper, we formulate a generic workload partitioning model that systematically exploits the complexity and approximation bound for optimal cache sharing among GPU threads. Our exploration in this paper demonstrates that it is possible to utilize GPU cache efficiently without significant programming overhead or ad-hoc application-specific implementation.
منابع مشابه
ACACES A Decoupled Access/Execute Architecture for Mobile GPUs
Smartphones are emerging as one of the fastest growing markets, providing enhanced capabilities every few months. However, supporting these hardware/software improvements comes at the cost of reducing the operating time per battery charge. The GPU is only left with a shrinking fraction of the power budget, but the trend towards better screens will inevitably lead to a higher demand for improved...
متن کاملA Cache-aware Thread Scheduling Policy for Multi-core Processors
A modern high-performance multi-core processor has large shared cache memories. However, simultaneously running threads do not always require the entire capacities of the shared caches. Besides, some threads cause severe performance degradation by inter-thread cache conflicts and shortage of capacity on the shared cache. To achieve high performance processing on multi-core processors, effective...
متن کاملArchitectural support for thread communications in multi-core processors
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving singlethreaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorp...
متن کاملUnderstanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis
Understanding multicore memory behavior is crucial, but can be challenging due to the cache hierarchies employed in modern CPUs. In today’s hierarchies, performance is determined by complex thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform simulation to sort out these interactions, but this can be costly ...
متن کاملData Movement Control for the PowerPC Architecture
There is an inherent cost for applications to access off-chip DRAM. A potential solution is a single large on-chip cache that all cores access with uniform latency. This architecture makes it easy for application developers to implement efficient inter-core sharing and use the entire on-chip cache. A large shared on-chip cache, however, is still prohibitively slow. Architects ensure each core h...
متن کامل